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IMPACT OF COVERAGE ON THE RELIABILITY 
OF A FAULT TOLERANT COMPUTER 

Salvatore J. Bavuso 
Langley Research Center 

SUMMARY 

A mathematical reliability model is established for a reconfigurable fault tolerant 
avionic computer system utilizing state-of-the-art computers. System reliability is stud- 
ied in light of the coverage probabilities associated with the first and second independent 
hardware failures. Coverage models are presented as a function of detection, isolation, 
and recovery probabilities. Upper and lower bounds are established for the coverage prob- 
abilities and the method for computing values for the coverage probabilities is investigated. 
Further, an architectural variation is proposed which is shown to enhance coverage. 

INTRODUCTION 

In recent years, the literature has contained numerous fault tolerant computer 
architectural designs which are enumerated in reference 1. What is strikingly apparent 
from the majority of those reported is the usual presentation of a cursory reliability 
assessment with comparable heuristic justification, if any assessment at all is present. 

Early attempts to arrive at realistic reliability estimates for computer systems 
appear to be due to Roth and Bouricius, et al. (ref. 2). With their presentation of the prob- 
abilistic concept of coverage, it was shown that coverage, defined as the conditional prob- 
ability that a proper recovery occurs if a fault exists, must approach 100 percent to gain 
the potential reliability attainable by modular replacement systems (ref. 3). Prior to this 
time, reliability analyses have assumed a coverage of unity upon omission of this concept 
in reliability equations. 

The application of the coverage concept to contemporary computer systems was 
reported by Sklaroff et al. (ref. 4). They express coverage for a two-fault tolerant triplex 
configuration as two components, a coverage component for the first failure and a coverage 
component for the second failure. A reliability comparison between three triplex systems 
with different failure coverage components is presented. Triplex system A is assigned a 
first failure coverage of unity and a second failure coverage of X30.5^X^1. Triplex 
system C is assigned the coverage probability of X^O.S^X^l for both first and second 


failures. Systems A and C are master slave architectures; and system B, which assumes 
a first failure coverage of unity and a second failure coverage X3 0.5^XSl,isa con- 
figuration in which all computers issue outputs to an external unit. The work presented 
in this paper addresses the latter system similarly to the analysis that was performed for 
triplex system C , with the addition of establishing upper and lower bounds on the first and 
second failure coverages, C-^ and C 2 , and a method for computing values for Cj and 
C 2 is investigated. In order to make the results realistic, the fault tolerant avionic flight 
control computer system utilized for this study is composed of three identical contempo- 
rary simplex computers. 


SYMBOLS 

A event, A channel is operational 

*^Ao event, channel A detects a fault in the other channel (B) 

*^Ag event, channel A detects a fault in itself 

B event, B channel is operational 

“Bq event, channel B detects a fault in the other channel (A) 

j 

“Bg event, channel B detects a fault in itself 

Cj probability of the system defined in figure 2 entering state Sj given that 

the system was previously in state Sj_j and that an unrepairable fault 
occurred in a channel where 1 s j ^ 2; failure coverage 

D fault detection event 

Dj event, no detection of a fault 

jDf event, correct detection of a fault for subset j; j = 1, single failure simplex 

isolating; j = 2, single failure cross isolating 

event, incorrect detection of a fault 

F fault event 

I fault isolation event 


2 
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input unit composed of ADC's 


i,j integers 

O output unit composed of DAC's 

P( ) probability of event ( ) occurring 

Pj(t) probability of the system being in state j at time t 

Pj^D,I,Rc|^^ - conditional probability the system will detect, isolate, and recon- 

figure and recover given that the jth fault occurred 

*^Pj = Pj^d|f^ conditional probability the system will detect a fault given that the 

jth fault occurred 

^Pj = Pj(l|D,F] conditional probability the system will isolate (to a channel) a fault 

given that the jth fault was detected and the fault occurred 

ik 

Pj isolation probability in channel k in state j 

**Pj = Pj(Rc|d,I,F^ conditional probability the system will reconfigure and recover 

given that the jth fault was detected and isolated and the fault 
occurred 

^P2 isolation probability which is identical for both duplex channels 

Pgf probability of system failure 

Q unreliability given by 1 - R 

unreliability of channel A, 1 - R^ 

Qg unreliability of channel B, 1 - Rg 

R reliability given by exp (-Xt) 

R^ reliability of channel A 

Rg reliability of channel B 
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Rg reconfiguration event 

j jR event, the system fails to recover upon channel failure while attempting to 

reconfigure from j channels to j-1 channels 

Sj jth state of the triplex system 

t time 

X coverage probability defined in reference 4 

Xj ith operational channel 

Xj ith malfunctioned channel 

X constant hardware failure rate of simplex computer 

= defined as 

3 such that 

~ complementary event, for example, the complement of event A is A 

j conditional event 

® exclusive "or" operation 

+ inclusive "or" operation 

n Boolean "and" operation 

U Boolean "or" operation 

» much greater than 

Abbreviations: 

ADC analog -to digital converter 

BITE built-in test equipment 
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CSC 

contemporary simplex computer 

DAC 

digital -to -analog converter 

MTTF 

mean time to failure 

RCS 

reconfigurable computer system 


FLIGHT CONTROL COMPUTER RELIABILITY MODEL 

The computer architecture selected for this study appears in figure 1 as a triplex 
reconfigurable computer system (RCS) composed of three identical computer channels. 

The contemporary simplex computer (CSC) contained within each channel is a typical aero- 
space class machine with a 16 000 word memory and a memory add time of 2 fis. In an 
aircraft environment, the CSC mean time to failure (MTTF) is predicted to be 3275 hr. 

The I unit is composed of 20 analog-to -digital converters (ADC's) with a combined 
MTTF of 4000 hr and the O unit is composed pf 20 digital -to -analog converters (DAC’s) 
with a combined MTTF of 5500 hr. A channel composed of a CSC, an I unit, and an 
O unit is assigned a predicted MTTF of 1357 hr which reflects the environmental effects 
of an operational aircraft. A mission time of 10 hr is assumed; also channel voting/ 
comparing is performed by software via an interprocessor bus. 

The Markov state space modeling technique was selected to represent a mathemati- 
cal reliability model for the described system. Reference 5 describes the theoretical 
basis for this technique. A Markov state space model of the triplex channel computer 
system with coverage factors for both first and second channel failures was developed and 
is presented in figure 2. The figure defines four system states of interest, states Sq to 
S3, where event is defined as the ith operational channel and event is the ith 
malfunctioned channel, 1 ^ i = 3. State Sq is the condition where all channels are oper- 
ational and is expressed as the Boolean product of three events, Sq = Xj^X2X3. State S3 
is the system -failure state and occurs when the system does not recover upon a channel 

failure while attempting to reconfigure from 3 channels to 2 channels (event 1^) simi- 
larly does not recover upon a channel failure while in the dual state ^i^^ channels 

fail. The figures between state nodes are transitional probabilities composed of coverage 
components Cj and €3, A., At, and a constant. The parameters Cj and €3 are 
defined as the probability of the system entering state Sj and S3 given that the system 
was previously in state Sq and Sj, respectively, and that an unrepairable fault occurred 
in a channel. Repairable faults, such as benign electrical transient faults, do not cause a 
change of system state, since this class of faults can be mitigated by machine state vector 
transfer or software rollback. Unrepairable transients, such as intermittent and long 
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duration faults, do cause a change of system state. Their effects can be incorporated by 
summing the unrepairable transient fault rate (assumed constant with time) with the chan- 
nel hardware failure rate. The parameter \ is the channel hardware failure rate and is 
assumed to be constant for this model. It is related to the reciprocal of the MTTF, by 
X = l/MTTF. The constant multiplier in the transitional probability term relates to the 
number of operational channels prior to failure and At is an increment of time. 

The RCS model can be expressed as a system of first-order ordinary differential 
equations (discrete state, continuous time) 

dPo(t) 
dt 

dPi (t) 

= 3XC^Po(t) - 2XPi(t) 

2XC2Pi(t) -XP2(t) 

= 3X(1 - Ci)Po(t) + 2X(1 - C2)Pi(t) + XP2(t) 
where Pq is the probability of the system being in state Sq, that is 

Po = P(So) Pi = P(Sl) P 2 - P(S 2 ) P 3 = P(S 3 ) 

and the initial conditions are 

Pq(0) = 1 P^(0) = P2(0) = P3(0) = 0 

The derivation of these equations can be found by inspection of the graph in figure 2 
by utilizing the following analog: From signal flow graph theory, the probability of the 
system being in state Sj is the analog of a signal source and the transition probability is 
the analog of a transmission gain. The probability of the system being in state Sj at 
time t + At is the sum of all signals arriving at the Sj node. The other nodes behave 
as probability sources at time t. For example, the ordinary differential equation asso- 
ciated with state zero is given by 

P(j(t + At) = Pg(t) - 3XCiAtPQ(t) - 3X(1 - Cj)AtPQ(t) 
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Rearranging terms and taking a limit gives 


lim 

At-0 


Po(t + At) - Po(t) 
At 


dPo(t) 

dt 


-3XPQ(t) 


The solution to the system of differential equations was derived analytically and is pre- 
sented as follows: 


Po(t) 

Pl(t) 

PgW 


>-3Xt 


SCj e-2^t _ e-3Xt 


3CjC2(e-^t _ 2e-2^t 


1 - 


PqW + Pj(t) + P 


+ e 




The results of the probability of system failure Pgf at 10 hr of mission time as a function 
of Cj and C2 are plotted in figure 3 for Pg. In the best case when Cj = C2 = 1, the 
Pg£ is predicted at 3.96 x lO""^ at 10 hr of mission time. For Cj and C2 less than 
unity, Pgf increases exponentially with C2. 

Intuitively, it is obvious that prior to fault recovery via reconfiguration, a fault must 
be detected and located by the computer system. It will be seen later that both of these 
factors can be incorporated in the computation of and C2. During initial operation, 

however, three computers are available for fault detection and isolation and, therefore, it 
is expected that Cj will be very nearly unity; whereas, after the first channel failure, 
fault isolation must be accomplished without a majority vote and is expected to cause C2 
to be much less than unity. The literature is extremely sparse in predicting values for 
Cj; however, reference 4 indicates C2 = 0.95 at the present state of the art (assuming 
perfect recovery from transient faults and perfect software in a correctly designed 
system) . 

When ^ 0.999, an interesting phenomenon occurs. The data (observed in fig. 3) 
indicate that Pg^ becomes insensitive to C2, that is, for some < 0.999, Pgf 
becomes insensitive to changes in C2. The implication is that if Cj is not sufficiently 
greater than 0.999, the achievement of high C2 is unimportant; hence Cj = 0.999 is 
assigned as a reasonable lower bound for this computer architecture. In view of the fact 
that hitherto Cj has been essentially ignored by the usual assumption of = 1, it 
appears that attention should be focused on determining realistic values of Cj. Alter- 
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nately, when Cj > 0.999, the data show that high gains in system reliability can be 
approached only if C 2 > 0.94; hence, C 2 = 0.94 is assigned as a lower bound. The 
data depicted in figure 3 show that for C 2 = 0.996, little gain in reliability occurs for 
Cj > 0.99999 (note superposition of such curves). The coverage value, Cj = 0.99999, 
may be assigned to this model as a reasonable upper bound, that is, the achievement of 
coverages greater than 0.99999 contributes little for this contemporary system since 
it is unlikely that values of C 2 » 0.996 can be achieved as is shown later. 

Cj COMPUTATION 

In the previous discussion, a mathematical relationship between RCS Pgj and 
coverage components was developed. Theoretical coverage bounds were established for 
system first and second failure coverage components. This section investigates the 
coverage contribution of simplex computer channels to the first and second failure cover- 
age components. 

The jth failure coverage Cj may be defined as 
Cj ^ Pj (D,I,Re|p) 

that is, the jth failure coverage is the probability that the system will detect a fault, iso- 
late the fault to a channel, and reconfigure and recover given that a fault occurred. Since 
Rj, is dependent on D, I, and F; I is dependent on D and F; and D is dependent 
on F, Cj may be further defined as the product of conditional probabilities. (See appen- 
dix A for derivation.) 

Cj = Pj(Rc|D4,f) • Pj(l|D,F) . Pj(D|F) 

Further 


where 

■^Pj = Pj(Re|DJ,F) 
‘Pj . Pj(l|D,F) 

'Ipj = Pj(d|f) 
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For the first failure coverage 



This equation expresses the events and their occurrence probabilities associated with the 
computer system’s ability to traverse from the triplex state Sq to the duplex state Sj 
as a result of a permanent channel failure. Failure detection and isolation can be accom- 
plished by channel majority voting. After this process, accomplished primarily by soft- 
ware, is completed, the values of and are determined essentially by the 

correctness of the system hardware and software design and the correctness of the soft- 
ware code. The utility of software self-testing or BITE (built-in test equipment) is less- 
ened by the massive hardware channel redundancy. However, since 100-percent hardware 
and software design verification and code correctness verification are still unachievable, 
Cj is most likely less than unity. A simple example to demonstrate this point regards 
the common practice of inserting identical copies of software into each CSC. In most 
cases, operational software contains latent software errors, errors not discovered during 
the software debugging and testing process. Such errors will never be detected by the 
majority voting process. The consequences of this type of error occurrence can be 
devastating to an aircraft which utilizes this computer system as the sole flight control 
system computer. 

The availability of massive channel redundancy, however, does not obviate the need 
for BITE and software self-test in the triplex state Sq since BITE is a hardware design 
implementation and, as such, is somewhat independent of software design. For instance, 
a latent software error although not detected by majority voting may trigger a BITE detec- 
tor indicating, for example, an overflow condition. Similarly, software self-testing should 
not be abandoned either since latent hardware faults and transient -caused faults (perma- 
nently altered unprotected memory) can be detected and perhaps corrected prior to the 
execution of certain critical applications programs such as end of mission (autoland) 
programs. Thus, when hardware and software design and software coding are considered 
correct, it is reasonable to assume that 



Of greater interest is the case where j = 2, and the second failure coverage is 
given by 
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For this case the probability of isolation becomes a predominant factor for C 2 since 
intuitively one recognizes that there is a high probability of detection by comparison; and 
if a fault could be isolated to a simplex computer, it is reasonable to believe there is a 
high probability of the system affecting a proper recovery. Assuming in the best case 

that *^P 2 = ~ ^2 studied in light of ^P 2 * 

For j = 2, the second failure coverage is expressed by 



where 

\ = 1 ‘pj s 1 = 1 

In the duplex mode, ^P 2 is based on the simplex computer failure detection probability 
which is a function of the isolation test thoroughness, testing time, and BITE detecting 
effectiveness. 

A system architecture that restricts software testing to a single simplex computer 
such that each simplex machine is capable of determining its own health is defined as a 
simplex isolating architecture. With this type of architecture configured in the duplex 
mode, the probability of isolation is identical to the probability of detection in a simplex 
computer. This conclusion is demonstrated in appendix B. Utilizing the state-of-the-art 
value for fault detection in a simplex computer which is given in reference 4 as 0.95, 

^Pg = 0.95 and, therefore, C 2 = 0.95. For Cj = 0.9999 and Cg = 0.95, figure 3 indi- 
cates a Pgf of 1.2 X 10'® as a reasonable state-of-the-art goal. This value for Pgf 
contrasts against the theoretical minimum of 3.96 x 10"'^. An interesting variation on 
determining the system isolation probability, which to this author’s knowledge has not been 
discussed in the literature, is to remove the restriction of self-testing to a single simplex 
computer. By allowing each simplex computer access to the other's registers, each 
machine can test itself as well as the other (cross isolating architecture). In this case, 
each machine can be conceptually considered as a fault detector searching for a fault in 
the union of the two simplex computer fault sets. The union of the fault sets becomes the 
universal fault set; and since the isolation events are independent 

fp _ Ip . 2p Ip 2p 

^2 " ^2 + ^2 ^2 ^2 
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where Pg is the isolation probability in machine number k for 1 S k ^ 2. The 
rationale for this conclusion is presented in appendix B. On letting 


then 




2 




2 


and 


since 





1 


by assumption. For 

^P2 = 0.95 C2 = 0.998 

Using Cj = 0.9999 and C2 = 0.998, the probability of system failure approaches 

2.8 X 10“® at 10 hr of mission time which is contrasted against the theoretical minimum 
of 3.96 X lO""^ at 10 hr when Cj = C2 = 1. 

On observing figure 3, the upper bound for in a simplex isolating architecture 

is 0.99999, since for C2 “ 0.95, all the curves for g 0.99999 are superimposed. 

For cross isolating architectures by contrast, the upper bound for Cj is 0.999999 since 
C2 is likely to approach 0.998. 

When P2 is assumed as a variable in a cross isolating architecture 



Figure 4 depicts the sensitivity of C2 to ^P2 and ^P2- The data show that C2 is 
considerably more sensitive to changes in ^P2 than to ^P2; in fact, for reasonably 
obtainable values of ^P2 ^0.9 < ^P2 < 0.95^, C2 is nearly completely determined by 
^P2 . This observation suggests that considerable effort be devoted toward improving 
^P2 rather than ^P2 when ^P2 > 0.95 and the computer architecture is a cross iso- 
lating architecture. 
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CONCLUDING REMARKS 


A mathematical model was established for the reliability of a reconfigurable fault 
tolerant avionic computer system utilizing contemporary simplex computers. The system 
reliability was computed as a function of a particular system configuration, mean time to 
failure for a contemporary simplex computer, mission time, and two coverage parameters, 
one associated with the first independent hardware failure Cj and the other with the 
second failure C2. 

Two variations of a duplex configuration were addressed and termed, simplex iso- 
lating architecture and cross isolating architecture. The former system architecture 
restricts software testing to a single simplex computer such that each simplex machine 
is capable of determining its own health. The latter architecture proposed by the author 
removes the restriction regarding software testing to a single simplex computer such 
that it allows each simplex computer access to the other’s registers, enabling each 
machine to test itself as well as the other. 

A lower bound for the first failure coverage was established at 0.999. When 
Cj S 0.999, the system probability of failure becomes independent of the second failure 
coverage C2 so that if Cj is not sufficiently greater than 0.999, the achievement of 
C2 is unimportant. This result suggests that more attention be focused on determining 
values of Cj. 

For a simplex isolating architecture, where values of C2 will probably be less 
than or equal to 0.95, an upper bound for of 0.99999 was established. For cross 
isolating architectures where C2 approaches 0.998, the upper bound for appears 
to be 0.999999. When > 0.999, the model data predict that high gains in system reli- 
ability can be approached only if second failure coverage values are much larger than 
0.94; therefore, C2 = 0.94 is assigned as a reasonable lower bound for C2. The upper 
bound for C2 for the described triplex system is 0.998. 

A model for computing and C2 was proposed for a cross isolating architec- 
ture, and an estimate for C2 was calculated to be 0.998 when perfect detection and 
recovery in the duplex configuration is assumed and the probability of isolating a fault 

^P2 is given as 0.95. Assuming Cj = 0.9999 and C2 = 0.998, the probability of system 
failure approaches 2.8 x 10"® at 10 hr which is contrasted against the theoretical minimum 
of 3.96 X lO"'^ at 10 hr when Cj = C2 = 1- The coverage estimates, primarily attributed 
to C2 < 1, appear to increase the probability of system failure by an order of magnitude. 
Further, it was shown that the major contributor to C2, for a cross isolating architec- 
ture, is the probability of reconfiguring and recovery in lieu of the failure isolation 
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and suggests that considerable effort be devoted toward improving 

^Pg (for ^P2>0.9 s). 

Finally, it should be noted that the modeling techniques developed in this paper are 
easily modified to study the case of three failures or greater tolerant systems and may 
be utilized to predict reliabilities for higher order systems. 

Langley Research Center 

National Aeronautics and Space Administration 
Hampton, Va. 23665 
June 9, 1975 


probability 
rather than 


13 


APPENDIX A 


Cj DERIVATION 

The following derivation is a straightforward application of conditional probabilities. 
= p.(d,i,Rc(f) = Pj(Rc,d,i|f), DninRc = RcnDni 

by commutivity. 

Cj = Pj(Rc,D,I|F) = Pj(Rc,D,I,F) |Pj(F) 

Pj(Rc,D,I,F) = Pj(Rc|D,I,F)Pj(D,I,F) 

Pj(D,I,F) = Pj(I,D,F) = Pj(l|D,F)Pj(D,F) 

Pj(D,F) = Pj(D|F)Pj(F) 

Pj(Rc,D,I,F) = Pj(Rc|D,I,F)Pj(ljD,F)Pj(D|F)Pj(F) 

^ _ Pj(Rc|D,I,F)Pj(l[D,F)Pj(D|F)Pj(F) 
r Pj(F) 

Cj - Pj(Rc|D,I,F)Pj(llD,F)Pj(D|F) 
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APPENDIX B 


PROBABILITY OF ISOLATION DERIVATION 

In the duplex mode, ^P2 is a function of the health of each machine, A and B, 
which can be modeled by the Poisson reliability model, R = e”^^, the ability of each 

d d 

machine to detect a fault in itself, Ag and °Bg, and the ability of each machine to 
detect a fault in the other machine , ^Aq and ^Bq. By allowing each event to be a 
binary event, there are 2® = 64 possible combination system states depicted in table 1 
which represent the universal sample space. By definition, the duplex system probability 
of fault detection is assigned unity; therefore, the subset of sample points which contains 
one or more failures (fault subset) is used to determine the probability of isolation. The 
fault subset may be partitioned to form 12 subsets of interest which are depicted as 
follows: 


Single 

failure 


Double 

failure 


Event 


Subset for 

- 

Simplex isolating 


Cross isolating 

^Df|A®B 

1 



2 

^Dj|a®B 

3 



4 

Dj|a0B 

5 



6 

^Dj|a®b 

7 



8 

^dJa®b 

9 



10 

Dfj A®B 

11 



12 


The heading, simplex isolating, defines the dual architecture with respect to detection, in 
that each machine cannot detect a fault in the other machine. The heading, cross isolating, 
removes this restriction. System detection of faults can be either correct, incorrect, or 
no detection may occur at all. The system could experience a single unrepairable channel 
fault or a double fault either simultaneously or nearly so. 

Subsets 1 and 2 are of particular interest since they represent the case in which the 
machine fault detectors are so designed that if they announce the detection of a fault, then 
the fault physically exists. It is postulated that if a processor is capable of announcing the 
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APPENDIX B - Continued 


existence of a fault then the fault is real. Subsets 3 and 4 for single failures cover the 
cases where phantom faults are announced. Only subsets 1 and 2 are considered in this 
paper. 

The sample points for appear in table 1 as Sg, Sjq, S^g, Sgg, Sgg, and 

Sgy. These events are represented by the following equations: 

85 = 

SjQ = A B ‘^As "^Ao % ^Bo 

Sjg = A B % \ \ % 


Sgg = A B ^^As ^Ao % % 


"37 


A B ^A„ % \ 


where the 1 indicates for A and B that the event failed and for 

d. 


Ag , Ar 


'^OJ ®s> 

The 0 indicates for A and B that the event did not fail 
and for the others that the event did not occur. 


^Bq that the event occurred. 


The conditional event D^jAsB is functionally related to the union of sample 
points Sg, SjQ, Sj^g, Sgg, Sgg, and Sg^^. By straightforward application of conditional 
probabilities 


pfDjA*B) = 

^ ^ P(A®B) 


p('^Dj|a0b) 

p('^dJA0b) 


^ P(^5^^33^^37) + p(SiqC Si8US26) 

P(A0B) 

^ ^(^ 5 ) + ^(^ 33 ) + P(^37) + P(^io) + P(Sl8) + P(S26) 

P(A©B) 


For subset 1, *^Dj|A 0 B is denoted jD^jAsB and is functionally related to the union of 
sample points Sg and Sjg as follows: 
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APPENDIX B - Continued 


P(Sg) = P(A) p(b) Pf Ag) p(<^Ao) P("^B^ P(%) 
and since subset 1 precludes cross detection, 


Pf Ao) = Pf Bo) = 1 


So 

PjSj) » P(A) p(b)p('’As) P^Bs) 

Further 

Pf lo) = 0 

since 

pf Ac) = 0 

Then 

Pfis) = p(a)p® PfAs) pfBg) 

since 

Pf^) = pf Bo) = 1 

Additionally, 


pM = » 


Pf33) = » 


Pfsv) = » 

since 

pfAo) = pfB„) = 0 

Therefore. 

> 

p(5dJa®b) - 

\1 fl / P(A®B) 

or 


r/c„ I J ^{\) p(N p (\) 

Q.R„+Q„R. 


where ^ P(A), the unreliability of A; and Rg^ p(b), the reliability of B. 
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APPENDIX B — Continued 
When both machines are identical, 

Q. 


"a = = ** 


Since the first term in the numerator is the condition where A failed, pPBs) = 1; 
likewise in the second term, Pr^Aj= 1, since B failed, and 

pgDf|A®B) = ||ppAs) + P^Bs) 

If both detection mechanisms are identical, 

P(jDf|A0B) = p(=*Ag) = pfBg) 

The result concludes that the probability of isolation for the duplex system is identical to 
that of a simplex machine, that is, = ^P 3 - 

A more interesting case which appears to have the potential of increasing system 
probability of isolation is subset 2. A cross isolating architecture can be physically 
affected by allowing each machine access to the other machine's registers, in which case 
it is feasible for one machine to diagnose faults in the other. Recalling the assumption 
that when a processor announces the detection of a fault, the fault physically exists, the 
following conditions occur for cross isolation: 


'(■*aJ = p('’bJ = 1 


fs) 

'{%) = p('’5o) = 1 

PfAj-* 0 

(Sfo) 

(“As) = p(%) - 1 


(=18) 

(%) ' PfSo) = 1 

p(«A„) # 0 

(®2e) 

(“Ao) = PfBs) = 1 

PfBo) * 0 

(®33) 

?Ao) - pK) - 1 

p(%) * 0 

(^ 37 ) 


On applying these conditions for subset 2 


p/Cp ^ P (^2 6^^33'-^ ^37) 

^2 f / Qa^B ^ ^B^A 
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APPENDIX B - Concluded 


the numerator becomes 


W P W + Pa'^b p W P (““Ps) + Pa^b p W p (''b J 
+ Pa^b P W P W + QaPb P (%) P W QaPb p W p (“b J 

Allowing both machines to be identical hardware gives 

p|d^|a®b) = I p(^Ag % + ‘^A^J % + % ‘^Bg + ‘^Ao ‘^Bg + % \ + ‘^Ag ‘^Bo) 
and assuming all detectors have equal detection probabilities gives 
Pf Ao) = p(‘^Ag) PpBo) = pfBg) 

Therefore, 

P^|Dj|A0Bj = P(Agj + P(Bg) - P(Agj P^Bg) 

This result is identical except in notation to the text equation 


i H ^2 H h 

^P 2 = P2 + P2 - P2 • P2 


Further, when P(Ag^ = P(Bg) = P 


;Dj|A®Bj = 


2P - P^ 


This result is identical to that contained in the text given as 

2 S -S' 
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TABLE 1.- ALL POSSIBLE SAMPLE POINTS 


0 1 n 3 4 5 6 7 8 9 10 11 12 13 14 15 

A 0101010101010101 

B 0011001100110011 

‘^Ag 0000111100001111 

^Aq 0000000011111111 

"^Bg 0000000000000000 

^Bq 0000000000000000 

32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 

A 0101010101010101 

B 0011001100110011 

^Ag 0000111100001111 

^Aq 0000000011111111 
‘^Bg 0000000000000000 
^Bg 1111111111111111 


16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 

0101010101010101 

0011001100110011 

0000111100001111 

0000000011111111 

1111111111111111 

0000000000000000 

48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 

0101010101010101 

0011001100110011 

0000111100001111 

0000000011111111 

1111111111111111 

1111111111111111 


CO 



Interprocessor bus 



Figure 1.- Triplex computer architecture. 


Si - Xi X2 X3 + Xi X2 X3 + Xi X2 X3 
(Two out of three channels operational) 


So = Xi X2 X3 

(All channels 
operational) 



(System failure) 


Figure 2.- Markov state space model of triplex channel RCS. 
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Probability of system failure at 10 hr, 
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